MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive

نویسندگان

Matthew Bernstein

AnHai Doan

Colin N. Dewey

چکیده

Motivation The NCBI's Sequence Read Archive (SRA) promises great biological insight if one could analyze the data in the aggregate; however, the data remain largely underutilized, in part, due to the poor structure of the metadata associated with each sample. The rules governing submissions to the SRA do not dictate a standardized set of terms that should be used to describe the biological samples from which the sequencing data are derived. As a result, the metadata include many synonyms, spelling variants and references to outside sources of information. Furthermore, manual annotation of the data remains intractable due to the large number of samples in the archive. For these reasons, it has been difficult to perform large-scale analyses that study the relationships between biomolecular processes and phenotype across diverse diseases, tissues and cell types present in the SRA. Results We present MetaSRA, a database of normalized SRA human sample-specific metadata following a schema inspired by the metadata organization of the ENCODE project. This schema involves mapping samples to terms in biomedical ontologies, labeling each sample with a sample-type category, and extracting real-valued properties. We automated these tasks via a novel computational pipeline. Availability and implementation The MetaSRA is available at metasra.biostat.wisc.edu via both a searchable web interface and bulk downloads. Software implementing our computational pipeline is available at http://github.com/deweylab/metasra-pipeline. Contact [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The sequence read archive: explosive growth of sequencing data

New generation sequencing platforms are producing data with significantly higher throughput and lower cost. A portion of this capacity is devoted to individual and community scientific projects. As these projects reach publication, raw sequencing datasets are submitted into the primary next-generation sequence data archive, the Sequence Read Archive (SRA). Archiving experimental data is the key...

متن کامل

Molecular Typing by Polymerase Chain Reaction Sequence Specific Primers (PCR- SSP) of Human Leukocyte Class I and Class II Alleles in a Sample of Iraqi Visceral Leishmaniasis Patients

Objective: This study aimed to investigate the association between HLA alleles and visceral leishmaniasis (VL) in a sample of Iraqi patients. Methods: A total of 30 patients were studied, in addition to 20 age, gender and ethnicity matched controls. All subjects were genotyped by polymerase chain reaction-sequence specific primers (PCR-SSP) method. Results: For HLA-class I region (A and B loci)...

متن کامل

Molecular Typing by Polymerase Chain Reaction Sequence Specific Primers (PCR- SSP) of Human Leukocyte Class I and Class II Alleles in a Sample of Iraqi Visceral Leishmaniasis Patients

متن کامل

Data and Methods for the Production of National Population Estimates: An Overview and Analysis of Available Metadata

Thomas Spoorenberg Translated by: Elham Fathi Statistical Center of Iran Abstract. Official population estimates can be produced using a variety of data sources and methods. These range from the direct extraction of information from continuously updated population registers to procedures for updating the status of a population enumerated previously in a periodic census. Additional sources and ...

متن کامل

Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive

It is important for public data repositories to promote the reuse of archived data. In the growing field of omics science, however, the increasing number of submissions of high-throughput sequencing (HTSeq) data to public repositories prevents users from choosing a suitable data set from among the large number of search results. Repository users need to be able to set a threshold to reduce the ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 33 شماره

صفحات -

تاریخ انتشار 2017

MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive

نویسندگان

چکیده

منابع مشابه

The sequence read archive: explosive growth of sequencing data

Molecular Typing by Polymerase Chain Reaction Sequence Specific Primers (PCR- SSP) of Human Leukocyte Class I and Class II Alleles in a Sample of Iraqi Visceral Leishmaniasis Patients

Molecular Typing by Polymerase Chain Reaction Sequence Specific Primers (PCR- SSP) of Human Leukocyte Class I and Class II Alleles in a Sample of Iraqi Visceral Leishmaniasis Patients

Data and Methods for the Production of National Population Estimates: An Overview and Analysis of Available Metadata

Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive

عنوان ژورنال:

اشتراک گذاری